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Abstract 

Transformation-based learning has been successfully 
employed to solve many natural language process- 
ing problems. It achieves state-of-the-art perfor- 
mance on many natural language processing tasks 
and does not overtrain easily. However, it does have 
a serious drawback: the training time is often in- 
torelably long, especially on the large corpora which 
are often used in NLP. In this paper, we present a 
novel and realistic method for speeding up the train- 
ing time of a transformation-based learner without 
sacrificing performance. The paper compares and 
contrasts the training time needed and performance 
achieved by our modified learner with two other 
systems: a standard transformatio n-based learner, 
and the ICA system ( Hepple, 200Cl| ). The results of 
these experiments show that our system is able to 
achieve a significant improvement in training time 
while still achieving the same performance as a stan- 
dard transformation-based learner. This is a valu- 
able contribution to systems and algorithms which 
utilize transformation-based learning at any part of 
the execution. 

1 Introduction 

Much research in natural language processing has 
gone into the development of rule-based machine 
learning algorithms. These algorithms are attractive 
because they often capture the linguistic features of 

a corpus in a small and concise set of rules. 

Tr ansformation-based learning (TBL) ( Brill J 
1995| ) is one of the most successful rule-based ma- 
chine learning algorithms. It is a fiexible method 
which is easily extended to various tasks and do- 
mains, and it has been applied to a wide vari ety of 



NLP tasks, including part of s peech tagging (Brill 
19951), nou n phrase chunking ( D-lamshaw and Mar- 



cus, 1999), parsing (Brill, 1996| ), phrase c hunking 



( Florian et al., 200Cl| ), spelling correction ( [Mangu| 



and Brill, 19971), prep ositional phrase atta chment 
(Brill and Resnik, 1994 ), dialog act tagging ( ^amuel 
et al., 199S|), segmen tation and message understand- 



ing (Day et al., 1997). Furthermore, transformation- 



based learning achieves state-of-the-art performance 



on s everal tasks, and is fairly resi stant to overtrain- 
ing ( Ramshaw and Marcus, 1994 ). 

Despite its attractive features as a machine learn- 
ing algorithm, TBL does have a serious draw- 
back in its lengthy training time, especially on the 
larger-sized corpora often used in NLP tasks. For 
example, a well-implemented transformation-based 
part-of-speech tagger will typically take over 38 
hours to finish training on a 1 million word cor- 
pus. This disadvantage is further exacerbated when 
the transformation-based learner is used as the base 
learner in learning algorithms such as boosting or 
active learning, both of which require multiple it- 
erations of estimation and application of the base 
learner. In this paper, we present a novel method 
which enables a transformation-based learner to re- 
duce its training time dramatically while still retain- 
ing all of its learning power. In addition, we will 
show that our method scales better with training 
data size. 

2 Transformation-based Learning 

The central idea of transformation-based learning 
(TBL) is to learn an ordered list of rules which 
progressively improve upon the current state of the 
training set. An initial assignment is made based on 
simple statistics, and then rules are greedily learned 
to correct the mistakes, until no net improvement 
can be made. 

The following definitions and notations will be 
used throughout the paper: 

• The sample space is denoted by 5; 

• C denotes the set of possible classifications of 
the samples; 

• C[s] denotes the classification associated with a 
sample s, and T[s] denotes the true classifica- 
tion of s; 

• p will usually denote a predicate defined on S; 

• A rule r is defined as a predicate - class label 
pair, {p, t), where t G C is called the target of r; 

• TZ denotes the set of all rules; 

• If r = {p, t) , pr will denote p and tr will denote 

t; 



• A rule r = (pr, tr) applies to a sample s if 
Pris) — true and tr / C[s]; the resulting sam- 
ple is denoted by r{s). 

Using the TBL framework to solve a problem as- 
sumes the existence of: 

• An initial class assignment. This can be as sim- 
ple as the most common class label in the train- 
ing set, or it can be the output of another clas- 
sifier. 

• A set of allowable templates for rules. These 
templates determine the types of predicates the 
rules will test; they have the largest impact on 
the behavior of the system. 

• An objective function / for learning. Unlike 
in many other learning algorithms, the objec- 
tive function for TBL will directly optimize the 
evaluation function. A typical example is the 
difference in performance resulting from apply- 
ing the rule: 

/ (^) = good (r) — bad (r) 

where 

good{r) = \{s\C[s]^T[s]AC[ris)]=T[s]}\ 
bad{r) = \{s\C[s] = T [s] A C [r (s)] ^ T [s]}\ 

Since we are not interested in rules that have a nega- 
tive objective function value, only the rules that have 
a positive good (r) need be examined. This leads to 
the following approach: 

1. Generate the rules (using the rule template set) 
that correct at least an error (i.e. good (r) > 0), 
by examining all the incorrect samples (s s.t. 

2. Compute the values bad{-) for each rule r such 
that good{r) > f{b) , storing at each point in 
time the rule b that has the highest score; while 
computing bad{r), skip to the next rule when 

fir)<fib) 
The system thus learns a Hst of rules in a greedy 
fashion, according to the objective function. When 
no rule that improves the current state of the train- 
ing set beyond a pre-set threshold can be found, the 
training phase ends. During the application phase, 
the evaluation set is initialized with the initial class 
assignment. The rules are then appHed sequentially 
to the evaluation set in the order they were learned. 
The final classification is the one attained when all 
rules have been applied. 

2.1 Previous Work 

As was described in the introductory section, the 
long training time of TBL poses a serious prob- 
lem. Various methods have been investigated to- 
wards ameliorating this problem, and the following 
subsections detail two of the approaches. 



2.1.1 The Ramshaw &: Marcus Approach 

One of the most time-consuming steps in 
transformation-based learning is the updating 
step. The iterative nature of the algorithm requires 
that each newly selected rule be appHed to the 
corpus, and the current state of the corpus updated 
before the next rule is learned. 



Ramshaw & Marcus (1994) attempted to reduce 
the training time of the algorithm by making the up- 
date process more efficient. Their method requires 
each rule to store a list of pointers to samples that 
it applies to, and for each sample to keep a list of 
pointers to rules that apply to it. Given these two 
sets of lists, the system can then easily: 

1. identify the positions where the best rule applies 
in the corpus; and 

2. update the scores of all the rules which are af- 
fected by a state change in the corpus. 

These two processes are performed multiple times 
during the update process, and the modification re- 
sults in a significant reduction in running time. 

The disadvantage of this method consists in the 
system having an unrealistically high memory re- 
quirement. For example, a transformation-based 
text chunker training upon a modestly-sized corpus 
of 200,000 words has approximately 2 miUion rules 
active at each iteration. The additional memory 
space required to store the lists of pointers associ- 
ated with these rules is about 450 MB, which is a 
rather large requirement to add to a systemF] 

2.1.2 The ICA Approach 



The ICA system (Hepple, 200C) aims to reduce the 
training time by introducing independence assump- 
tions on the training samples that dramatically re- 
duce the training time with the possible downside of 
sacrificing performance. 

To achieve the speedup, the ICA system disallows 
any interaction between the learned rules, by enforc- 
ing the following two assumptions: 

• Sample Independence — a state change in a 
sample (e.g. a change in the current part- 
of-speech tag of a word) does not change the 
context of surrounding samples. This is cer- 
tainly the case in tasks such as prepositional 
phrase attachment, where samples are mutually 
independent. Even for tasks such as part-of- 
speech tagging where intuition suggests it does 
not hold, it may still be a reasonable assump- 
tion to make if the rules apply infrequently and 
sparsely enough. 



^We need to note that the 200k- word corpus used in this 
experiment is considered small by NLP standards. Many of 
the available corpora contain over 1 million words. As the 
size of the corpus increases, so does the number of rules and 
the additional memory space required. 



• Rule Commitment — there will be at most one 
state change per sample. In other words, at 
most one rule is allowed to apply to each sample. 
This mode o f application is similar to that of a 
decision Hst ( Rivest, 1987 ), where an sample is 
modified by the first rule that applies to it, and 
not modified again thereafter. In general, this 
assumption will hold for problems which have 
high initial accuracy and where state changes 
are infrequent. 

The ICA system was designed and tested on the 
task of part-of-speech tagging, achieving an impres- 
sive reduction in training time while suffering only 
a small decrease in accuracy. The experiments pre- 
sented in Section S include ICA in the training time 
and performance comparisonsg. 

2.1.3 Other Approaches 



Samuel (199^ ) proposed a Monte Carlo approach 
to transformation-based learning, in which only a 
fraction of the possible rules are randomly selected 
for estimation at each iteration. The /z-TBL sys- 



tem described in Lager (1999) attempts to cut down 
on training time with a more efficient Prolog imple- 
mentation and an implementation of "lazy" learning. 
The appHcation of a transformation-based learning 
can be considerably sped-up if the rules are co mpiled 
in a finite-state tr ansducer, as described in Roche| 
and Schabes (19951 ). 



3 The Algorithm 

The approach presented h ere builds on the same 
found ation as the one in ( Ramshaw and Marcus J 
1994): instead of regenerating the rules each time, 
they are stored into memory, together with the two 
values good (r) and bad (r) . 

The following notations will be used throughout 
this section: 

• G {r) — {s E S\pr{s) — true and C[s] ^ 
tr and tr = T[s]} — the samples on which the 
rule appHes and changes them to the correct 
classification; therefore, good{r) = \G{r)\. 

• B {r) — {s e S\pr{s) — irwe and C[s] ^ 
tr and C[s\ = T[s\} — the samples on which 
the rule applies and changes the classification 
from correct to incorrect; similarly, bad(r) = 

\B{r)\. 

Given a newly learned rule b that is to be applied 
to iS, the goal is to identify the rules r for which at 
least one of the sets G {r) , B (r) is modified by the 
application of rule b. Obviously, if both sets are not 
modified when applying rule 6, then the value of the 
objective function for rule r remains unchanged. 



The presentation is complicated by the fact that, 
in many NLP tasks, the samples are not indepen- 
dent. For instance, in POS tagging, a sample is de- 
pendent on the classification of the preceding and 
succeeding 2 samples (this assumes that there ex- 
ists a natural ordering of the samples in S). Let 

V (s) denote the "vicinity" of a sample — the set of 
samples on whose classification the sample s might 
depend on (for consistency, s £ V{s)); if samples are 
independent, then V (s) = {s}. 

3.1 Generating the Rules 

Let s be a sample on which the best rule b applies 
(i.e. [b{s)] ^ C[s]). We need to identify the rules 
r that are infiuenced by the change s — > b{s). Let 
r be such a rule. / (r) needs to be updated if and 
only if there exists at least one sample s' such that 

s' eG (r) and b (s') ^ G (r) or (1) 

s' E B (r) and 6 (s') ^ B (r) or (2) 

s' (^G (r) and b (s') E G (r) or (3) 

s' ^B (r) and 6 (s') E B (r) (4) 

Each of the above conditions corresponds to a spe- 
cific update of the good{r) or bad{r) counts. We 
will discuss how rules which should get their good or 
bad counts decremented (subcases (Q) and (||)) can 
be generated, the other two being derived in a very 
similar fashion. 

The key observation behind the proposed algo- 
rithm is: when investigating the effects of applying 
the rule b to sample s, only samples s' in the set 

V (s) need to be checked. Any sample s' that is not 
in the set 

U ^(^) 

{s\h changes s} 

can be ignored since s' — b{s'). 

Let s' E V (s) be a sample in the vicinity of s. 
There are 2 cases to be examined — one in which b 
applies to s' and one in which b does not: 

Case I: c{s') = c{b{s')) (6 does not modify the 
classification of sample s'). We note that the 
condition 

s' EG (r) and b (s') ^ G (r) 

is equivalent to 



The algorithm was iii 



the the authors, fol- 



lowing the description in [Hepple (2000[) 



Pr (s') = true A C [s'] ^ tr A 
tr ^T[s'] A Pribis')) ^ false 

and the formula 

s' eB (r) and b {s') ^ B (r) 

is equivalent to 

Pr {s') = true A C [s'] ^ ir A 
G [s'] = T [s'] A Pr {b (s') ) = false 



(5) 



(6) 



(for the full details of the derivation, inferred from 
the definition of G (r) a nd B (r) , please refer to 
Florian and Ngai (200l| )). 

These formulae offer us a method of generating 
the rules r which are influenced by the modiflcation 
s'^6(s'): 

1. Generate all predicates p (using the predicate 
templates) that are true on the sample s' . 

2. If C [s'] ^ T [s'] then 

(a) \ip{b{s')) — false then decrease good{r), 
where r is the rule created with predicate 
p s.t. target T [s']; 

3. Else 

(a) If p(6(s')) — false then for all the rules 
r whose predicate is j^ and tr ^ C [s'\ de- 
crease bad (r) ; 

The algorithm for generating the rules r that need 
their good counts (formula (|^)) or bad counts (for- 
mula (^) increased can be obtained from the formu- 
lae (0) (respectively (||)), by switching the states s' 
and b (s'), and making sure to add all the new pos- 
sible rules that might be generated (only for (||)). 

Case II: C [s'] ^ C[b (s')] (6 does change the clas- 
sification of sample s'). In this case, the formula (H) 
is transformed into: 



p^ (s') = true A C [s'] ^ U A U = T [s'] A 
{p^ (6 (s')) = false y U ^ C [b [s')]) 



(7) 



(again, the f ull derivation is presented in Florian and| 
Ngai (2001 )). The case of (0), however, is much 



simpler. It is easy to notice that C [s'] ^ C[b (s')] 
and s' G B [r) implies that b{s') ^ B {r)\ indeed, 
a necessary condition for a sample s' to be in a set 
B (r) is that s' is classified correctly, C [s'] = T [s'] . 
Since T [s'] ^C\b (s')], results C [6 (s')] 7^ T \s'\ and 
therefore 6(s') ^ B{r). Condition (g) is, therefore, 
equivalent to 



Pr (s') = true A C [s'] ^ U h C [s'] = T [s'] 



(8) 



The algorithm is modified by replacing the test 
p (b [s')) = false with the test Pr (b (s')) — false V 
C \b (s)] = tr in formula (|^) and removing the test 
altogether for case of ^ . The formulae used to gen- 
erate rules r that might have their counts increased 
(equations (^ and (Q)) are obtained in the same 
fashion as in Case I. 

3.2 The Full Picture 

At every point in the algorithm, we assumed that all 
the rules that have at least some positive outcome 
[good (r) > 0) are stored, and their score computed. 

^This can be done efBciently with an appropriate data 
structure - for example, using a double hash. 



For all samples s that satisfy C[s\ ^ T [s], generate all rules 
r that correct the classification of s; increase good{r). 
For all samples s that satisfy C[s\ =T [s\ generate all pred- 
icates p s.t. p{s) = true; for each rule r s.t. Pr = P and 
tr ^ C [s] increase bad (r). 
1: Find the rule b = argmax^gfj / (r). 
If (/ (6) < Threshold or corpus learned to completion) then 
quit. 

For each predicate p, let TZ (p) be the rules whose predicate 
is p (pr = r). 

For each samples s, s' s.t. C [s] ^ C [b (s)] and s' £ V (s): 
If C[s'] = C[b{s')] then 

• for each predicate p s.t. p{s') = true 

-If C[s'] ytT[s'] then 

*If p(6(s')) = false then decrease good{r), 

where r = [p,T [s']] , the rule created with 

predicate p and target T [s']; 
-Else 

* If p{b{s')) = false then for all the rules 
r £Tl{p) s.t. tr ^ C [s'] decrease bad (r); 

• for each predicate p s.t. p{b{s')) = true 

-If C[fe(s')] J^Tls'] then 

* If p(s') = false then increase good{r), 
where r = [p,T[s']]; 

-Else 

* If p (s') = false then for all rules r & TZ{p) 

s.t. tr ^ C\b(s')] increase bad(r)-, 
Else 

• for each predicate p s.t. p{s') = true 

-lfC[s'] T^Tls'] then 

*If p(6(s')) = false V C[b{s')] = tr then 
decrease good{r), where r = [p,T[s']]; 
-Else 

* For all the rules r e Hip) s.t. tr ^ C [s'] 
decrease bad (r); 

• for each predicate p s.t. p{b{s')) = true 

-liC[b{s')] ^T[s'] then 

* If p (s') = false V C [s'] = tr then increase 
good(r), where r = [p,T[s']]; 

-Else 

*For all rules r & n(p) s.t. tr ^ C[b{s')] 
increase bad{r)\ 
Repeat from step 1: 



Figure 1: FastTBL Algorithm 



Therefore, at the beginning of the algorithm, all the 
rules that correct at least one wrong classification 
need to be generated. The bad counts for these rules 
are then computed by generation as well: in every 
position that has the correct classification, the rules 
that change the classification are generated, as in 
Case 0, and their bad counts are incremented. The 
entire FastTBL algorithm is presented in Figure |^. 
Note that, when the bad counts are computed, only 
rules that already have positive good counts are se- 
lected for evaluation. This prevents the generation 
of useless rules and saves computational time. 

The number of examined rules is kept close to the 
minimum. Because of the way the rules are gen- 
erated, most of them need to modify either one of 
their counts. Some additional space (besides the one 
needed to represent the rules) is necessary for repre- 
senting the rules in a predicate hash — in order to 



have a straightforward access to all rules that have a 
given predicate; this amount is considerably smaller 
than the one used to represent the rules. For exam- 
ple, in the case of text chunking task described in 
section^, only approximately 30Mb a dditional mem- 
ory is required , while the approach of Ramshaw and 



Marcus (1994) would require approximately 450Mb. 



3.3 Behavior of the Algorithm 

As mentioned before, the original algorithm has a 
number of deficiencies that cause it to run slowly. 
Among them is the drastic slowdown in rule learning 
as the scores of the rules decrease. When the best 
rule has a high score, which places it outside the tail 
of the score distribution, the rules in the tail will be 
skipped when the bad counts are calculated, since 
their good counts are small enough to cause them 
to be discarded. However, when the best rule is in 
the tail, many other rules with similar scores can no 
longer be discarded and their bad counts need to be 
computed, leading to a progressively longer running 
time per iteration. 

Our algorithm does not suffer from the same prob- 
lem, because the counts are updated (rather than 
recomputed) at each iteration, and only for the sam- 
ples that were affected by the application of the lat- 
est rule learned. Since the number of affected sam- 
ples decreases as learning progresses, our algorithm 
actually speeds up considerably towards the end of 
the training phase. Considering that the number 
of low-score rules is a considerably higher than the 
number of high-score rules, this leads to a dramatic 
reduction in the overall running time. 

This has repercussions on the scalability of the al- 
gorithm relative to training data size. Since enlarg- 
ing the training data size results in a longer score dis- 
tribution tail, our algorithm is expected to achieve 
an even more substantial relative running time im- 
provement over the original algorithm. Section ^ 
presents experimental results that validate the su- 
perior scalability of the FastTBL algorithm. 

4 Experiments 

Since the goal of this paper is to compare and con- 
trast system training time and performance, extra 
measures were taken to ensure fairness in the com- 
parisons. To minimize implementation differences, 
all the code was written in C++ and classes were 
shared among the systems whenever possible. For 
each task, the same training set was provided to each 
system, and the set of possible rule templates was 
kept the same. Furthermore, extra care was taken 
to run all comparable experiments on the same ma- 
chine and under the same memory and processor 
load conditions. 

To provide a broad comparison between the sys- 
tems, three NLP tasks with different properties 
were chosen as the experimental domains. The 



first task, part-of-speech tagging, is one where the 
commitment assumption seems intuitively valid and 
the samples are not independent. The second 
task, prepositional phrase attachment, has examples 
which are independent from each other. The last 
task is text chunking, where both independence and 
commitment assumptions do not seem to be valid. 
A more detailed description of each task, data and 
the system parameters are presented in the following 
subsections. 

Four algorithms are compared during the follow- 
ing experiments: 

• The regular TBL, as described in section g; 

• An improved version of TBL, which makes ex- 
tensive use of indexes to speed up the rules' up- 
date; 

• The FastTBL algorithm; 



The ICA algorithm (Hepple, 2000). 



4.1 Part-of-Speech Tagging 

The goal of this task is to assign to each word 
in the given sentence a tag corresponding to its 
part of speech. A multitude of approaches have 
been proposed to solve this problem, including 
transformation-based learning. Maximum Entropy 
models. Hidden Markov models and memory-based 
approaches. 

The data used in the experiment was selected from 
the Penn Tre ebank Wall Street J ournal, and is the 
same used by Brill and Wu (199q ). The training set 
contained approximately IM words and the test set 
approximately 200k words. 

Table [^ presents the results of the experiment^. 
All the algorithms were trained until a rule with 
a score of 2 was reached. The FastTBL algorithm 
performs very similarly to the regular TBL, while 
running in an order of magnitude faster. The two 
assumptions made by the ICA algorithm result in 
considerably less training time, but the performance 
is also degraded (the difference in performance is sta- 
tistically significant, as determined by a signed test, 
at a significance level of 0.001). Also present in Ta- 
ble m are the results of training Brill's tagger on the 
same data. The results of this tagger are presented 
to provide a performance comparison with a widely 
used tagger. Also worth mentioning is that the tag- 
ger achieved an accuracy of 96.76% when train ed on 



the entire data l; a Maximum Entropy tagger ( |Rat 
aaparkhi, 199(:) achieves 96.83% accuracy with the 



same training data/test data. 

■'The time shown is the combined running time for both 
the lexical tagger and the contextual tagger. 

®We followed the setup from Brill's tagger: the contextual 
tagger is trained only on half of the training data. The train- 
ing time on the entire data was approximately 51 minutes. 





Brill's tagger 


Regular TBL 


Indexed TBL 


FastTBL 


ICA (Hepple) 


Accuracy 


96.61% 


96.61% 


96.61% 


96.61% 


96.23% 


Running time 


5879 mins, 46 sees 


2286 mins, 21 sees 


420 mins, 7 sees 


17 mins, 21 sees 


6 mins, 13 sees 


Time ratio 


0.4 


1.0 


5.4 


131.7 


367.8 



Table 1: POS tagging: Evaluation and Running Times 





Regular TBL 


Indexed TBL 


Fast TBL 


ICA (Hepple) 


Accuracy 


81.0% 


81.0% 


81.0% 


77.8% 


Running time 


190 mins, 19 sees 


65 mins, 50 sees 


14 mins, 38 sees 


4 mins, 1 see 


Time Ratio 


1.0 


2.9 


13 


47.4 



Table 2: PP Attachment :Evaluation and Running Times 



4.2 Prepositional Phrase Attachment 

Prepositional phrase attachment is the task of decid- 
ing the point of attachment for a given prepositional 
phrase (PP). As an example, consider the following 
two sentences: 

1. I washed the shirt with soap and water. 

2. I washed the shirt with pockets. 

In Sentence |l|, the PP "with soap and water" de- 
scribes the act of washing the shirt. In Sentence 0, 
however, the PP "with pockets" is a description for 
the shirt that was washed. 

Most previous work has concentrated on situa- 
tions which are of the form VP NPl P NP2. The 
problem is cast as a classification task, and the sen- 
tence is reduced to a 4-tuple containing the preposi- 
tion and the non-inflected base forms of the head 
words of the verb phrase VP and the two noun 
phrases NPl and NP2. For example, the tuple cor- 
responding to the two above sentences would be: 

1. wash shirt with soap 

2. wash shirt with pocket 

Many approaches to solving this this problem have 
been proposed, most of them using standard ma- 
chine learning techniques, including transformation- 
based learning, decision trees, maximum entropy 
and backoff estimation. The transformation-based 



learning svstem w as originally developed by [Bril] 
and Resnik (1994|) . 

The data used in the experiment consists of ap- 
proximately 13,000 quadruples (VP NPl P NP2) 
extracted from Penn Ti-eebank parses. The set is 
split into a test set of 500 samples and a training set 
of 12,500 samples. The templates used to generate 
rules are similar to the ones used by Brill and Resnik 



(1994 ) and some include WordNet features. All the 



systems were trained until no more rules could be 
learned. 

Table g shows the results of the experiments. 
Again, the ICA algorithm learns the rules very fast, 
but has a slightly lower performance than the other 
two TBL systems. Since the samples are inherently 
independent, there is no performance loss because 



of the independence assumption; therefore the per- 
formance penalty has to come from the commitment 
assumption. The Fast TBL algorithm runs, again, 
in a order of magnitude faster than the original TBL 
while preserving the performance; the time ratio is 
only 13 in this case due to the small training size 
(only 13000 samples). 

4.3 Text Chunking 

Text chunking is a subproblem of syntactic pars- 
ing, or sentence diagramming. Syntactic parsing at- 
tempts to construct a parse tree from a sentence by 
identifying all phrasal constituents and their attach- 
ment points. Text chunking simplifies the task by 
dividing the sentence into non-overlapping phrases, 
where each word belongs to the lowest phrasal con- 
stituent that dominates it. The following exam- 
ple shows a sentence with text chunks and part-of- 
speech tags: 

[NP A.V.NNP GmenNNP ] [ADVP 

currently KB ] [VP has ] [NP 2,664,098cd 
sharesATATs] [ADJP outstandingjj ] . 

The problem can be transformed into a clas sification 
task. Following Ramshaw & Marcus' ( 1999| ) work in 
base noun phrase chunking, each word is assigned 
a chunk tag corresponding to the phrase to which 
it belongs . The following table shows the above 
sentence with the assigned chunk tags: 



Word 


POS tag 


Chunk Tag 


A.P. 


NNP 


B-NP 


Green 


NNP 


I-NP 


currently 


RB 


B-ADVP 


has 


VBZ 


B-VP 


2,664,098 


CD 


B-NP 


shares 


NNS 


I-NP 


outstanding 


JJ 


B-ADJP 
O 



The data used in this expe riment is the CoNLL- 
2000 phrase chu nking corpus ( Tjong Kim Sang and 
Buchholz, 2000| ). The training corp us consists of 



sectio ns 15-18 of the Penn Treebank ( [Marcus et al., 



1993); section 20 was used as the test set. The chunk 



tags are derived from the parse tree constituents, 





Regular TBL 


Indexed TBL 


Fast TBL 


ICA (Hepple) 


F -measure 


92.30 


92.30 


92.30 


86.20 


Running Time 


19211 mins, 40 sees 


2056 mins, 4secs 


137 mins, 57 sees 


12 mins, 40 sees 


Time Ratio 


1.0 


9.3 


139.2 


1516.7 



Table 3: Text Chunking: Evaluation and Running Times 



and the part-of-spe ech tags were generated by Brill's 
tagger ( Brill, 1995| ). All the systems are trained to 
completion (until all the rules are learned). 

Table ^ shows the results of the text chunking ex- 
periments. The performance of the FastTBL algo- 
rithm is the same as of regular TBL's, and runs in an 
order of magnitude faster. The ICA algorithm again 
runs considerably faster, but at a cost of a signifi- 
cant performance hit. There are at least 2 reasons 
that contribute to this behavior: 

1. The initial state has a lower performance than 
the one in tagging; therefore the independence 
assumption might not hold. 25% of the samples 
are changed by at least one rule, as opposed to 
POS tagging, where only 2.5% of the samples 
are changed by a rule. 

2. The commitment assumption might also not 
hold. For this task, 20% of the samples that 
were modified by a rule are also changed again 
by another one. 

4.4 Training Data Size Scalability 

A question usually asked about a machine learning ;;mbigation "andl ctive'learnTng. Rece"nt work"(|Fg 



5 Conclusions 

We have presented in this paper a new and im- 
proved method of computing the objective function 
for transformation-based learning. This method al- 
lows a transformation-based algorithm to train an 
observed 13 to 139 times faster than the original 
one, while preserving the final performance of the 
algorithm. The method was tested in three differ- 
ent domains, each one having different characteris- 
tics: part-of-speech tagging, prepositional phrase at- 
tachment and text chunking. The results obtained 
indicate that the algorithmic improvement gener- 
ated by our method is not linked to a particular 
task, but extends to any classification task where 
transformation-based learning can be applied. Fur- 
thermore, our algorithm scales better with training 
data size; therefore the relative speed-up obtained 
will increase when more samples are available for 
training, making the procedure a good candidate for 
large corpora tasks. 

The increased speed of the Fast TBL algorithm 
also enables its usage in higher level machine learn- 
ing algorithms, such as adaptive boosting, model 



is how well iL adapts Lo 
jf Lidining data. Since the peifuimance uf the Fast 
TBL algorithm is identical to that of regular TBL, 
the issue of interest is the dependency between the 
running time of the algorithm and the amount of 
training data. 

The experiment was performed with the part-of- 
speech data set. The four algorithms were trained 
on training sets of different sizes; training times were 
recorded and avera ged o ver 4 trials. The results are 
presented in Figure |2(a) . It is obvious that the Fast 
TBL algorithm is much more scalable than the reg- 
ular TBL — displaying a linear dependency on the 
amount of training data, while the regular TBL has 
an almost quadratic dependency. The ex plan ation 
for this b ehav ior has been given in Section 3.3. 

Figure |2(b)| shows the time spent at each iteration 
versus the iteration number, for the original TBL 
and fast TBL systems. It can be observed that the 
time taken per iteration increases dramatically with 
the iteration number for the regular TBL, while for 
the FastTBL, the situation is reversed. The con- 
sequence is that, once a certain threshold has been 
reached, the incremental time needed to train the 
FastTBL system to completion is neghgible. 



plgoriLhm is how^well 11^ adapL^s Lo largei' amoiiiiLs ^-^^ ^^ ^^^ ^^^ ^^^ ^^^^^ ^^^ ^ ^BL fr^ 



work can be adapted to generate confidences on the 
output, and our algorithm is compatible with that 
framework. The stability, resistance to overtraining, 
the existence of probability estimates and, now, rea- 
sonable speed make TBL an excellent candidate for 
solving classification tasks in general. 
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